In visual question answering (VQA), an algorithm must answer text-basedquestions about images. While multiple datasets for VQA have been created sincelate 2014, they all have flaws in both their content and the way algorithms areevaluated on them. As a result, evaluation scores are inflated andpredominantly determined by answering easier questions, making it difficult tocompare different methods. In this paper, we analyze existing VQA algorithmsusing a new dataset. It contains over 1.6 million questions organized into 12different categories. We also introduce questions that are meaningless for agiven image to force a VQA system to reason about image content. We propose newevaluation schemes that compensate for over-represented question-types and makeit easier to study the strengths and weaknesses of algorithms. We analyze theperformance of both baseline and state-of-the-art VQA models, includingmulti-modal compact bilinear pooling (MCB), neural module networks, andrecurrent answering units. Our experiments establish how attention helpscertain categories more than others, determine which models work better thanothers, and explain how simple models (e.g. MLP) can surpass more complexmodels (MCB) by simply learning to answer large, easy question categories.
展开▼